There is no denying the importance of education. At the individual level, a strong education is important for personal growth, acquiring knowledge, and developing critical thinking skills. At the societal level, educational outcomes are highly correlated to better health and higher earnings. However, in the United States, there are many factors that can influence the quality of education and students’ educational success. Understanding how these socioeconomic and political factors affect academic achievement outcomes can better inform future educational policies and programs.
Our group was inspired to pursue this project after many of us took a statistics course at Harvard Graduate School of Education during the Spring of 2021. For this class we often applied the methods we were learning to data from The Education Opportunity Project at Stanford University. This dataset, referred to as the Stanford Education Data Archive (SEDA) provides detailed information on educational outcomes by various levels (school, district, county, community, and state). We were motivated to use this first national database of academic performance to ask questions about predictors of academic achievement and school funding between states and at the national level.
In the first part of our project (Part 1), we aim to understand the variables that predict academic achievement on a district level. This section of our project has three components. First, we will use exploratory data analysis and machine learning modeling to determine key predictors of academic achievement on which to focus. Next we will take a closer look at socio-economic status, unemployment, and poverty as predictors of academic achievement, using Washington and Massachusetts as case studies. Finally, we will explore how receiving free lunch at school predicts academic success.
To summarize, we are specifically aiming to answer the following questions:
What are some of the most important predictors of academic achievement at the national level?
How is unemployment, poverty, and socioeconomic status associated with academic achievement in Massachusetts, Washington, Nebraska, and California?
How is percent free or reduced lunch associated with academic achievement in districts at the national level?
Throughout this RMarkdown, we will discuss in depth how our though process surrounding these questions were developed and how they evolved throughout the course of the project. For example, the results of exploratory data analysis and machine learning models helped to inform areas of focus in our project.
In our project, we use the SEDA dataset that maps student achievement based on different district level covariates. The Educational Opportunity Project at Stanford University measures educational opportunity in diverse communities all over America. We downloaded the files from the publicly available datasets that the SEDA project has online here.
For data on academic achievement, we downloaded seda_geodist_pool_cs_4.0.dta, which includes geographic school district achievement estimates. In this dataset, achievement estimates use a cohort standardized (CS) scale based on the cohort that was in the 4th grade in Spring 2009. This CS metric standardizes test performance so that the scores represent the standard deviations away from the national grade-subject-specific performance of this single cohort. For our analysis, we will be specifically using the variable cs_mn_avg_ol to measure district academic achievement. This variable measures district performance on Math/Reading standardized tests and is reported on the CS scale (in standard deviations away from the national average).
Therefore, a district with a cs_mn_avg_ol score of zero is performing in line with the average district performance at the national level. On the other hand, a district with a cs_mn_avg_ol score of -1 is performing 1 standard deviation below the national average while a district with a cs_mn_avg_ol score of + 1 is peforming 1 standard deviation above.
For data on the covariates that we will be using to predict academic achievement in this analysis, we downloaded seda_cov_geodist_pool_4.0.dta. This dataset includes over 50 variables illustrating the demographic and socio-economic characteristics of the districts in the United States. These covariates will be further described below.
We use the inner join function to join the achievement scores dataset and the covariates dataset and subset the data to include only achievement outcomes reported at the district level (not by sub-group) and with no missing values in our outcome and some key predictors (cs_mn_avg_ol, unempavgall, povertyavgall, sesavgall)
# Load achievement data (district, CS, pooled)
# data obtained from: "https://stacks.stanford.edu/file/druid:db586ns4974/seda_geodist_pool_cs_4.0.dta"
# This data set contains data on the outcome of interest, district academic achievement.
# In this data set, academic achievement is measured in terms of standard deviation
# units and is what we will be using for our regression analyses
ach <- read_dta("Part1_SEDA_achievement.dta")
# Load main covariates data (district level), pool
# data obtained from https://stacks.stanford.edu/file/druid:db586ns4974/district%20covariates.dta"
# This data set contains an extensive list of covariates that will be tested as
# predictors of our outcome, academic achievement
covs <- read_dta("Part1_SEDA_covariates.dta")
# Merge files together. This keeps only matched districts.
dat <- inner_join(ach, covs, by = c("sedalea", "fips"))
# Subset to the "all" subgroup estimates for all students, only districts
# with an estimate of average achievement, unemployment, poverty, ses avergage, and non-missing SES.
dat <- filter(dat,
subgroup == "all",
!is.na(cs_mn_avg_ol),
!is.na(unempavgall),
!is.na(povertyavgall),
!is.na(sesavgall))
After creating this data set, we noticed that the covariate data that we imported above does not include variables related to district spending. This is an area of interest of our team that will be explored further in Part 2. Therefore, we obtained an additional SEDA dataset that had district spending included and merged this with the data set above.
#importing the data set that includes variables of interest: ppexp_tot, ppexp_inst, pprev_tot
spending <- read_dta("Part1_SEDA_covariates2.dta")
# Subsetting data set to only include variables of interest
# also including the district ID, titled leaid here, and state ID (fips) for joining
spending <- subset(spending, select = c(ppexp_tot, ppexp_inst, pprev_tot, leaid, fips))
# since the name of the district ID variable in the dataset "spending" (leaid) is different
# from the district ID variable in the data set "dat" (sedalea), we need to change this
spending <- rename(spending, sedalea = leaid)
#checking the class of `sedalea` in "spending" df
class(spending$sedalea)
## [1] "character"
#checking the class of `sedalea` in "dat" df
class(dat$sedalea)
## [1] "numeric"
#converting `sedalea` in "spending" df to numeric class so that they are both of the same class
spending$sedalea <- as.numeric(spending$sedalea)
#confirming that `sedalea` in "spending" is now numeric
class(spending$sedalea)
## [1] "numeric"
# Merge files together. This keeps only matched districts.
dat <- inner_join(dat, spending, by = c("sedalea", "fips"))
#determining the final # observations and # variables in dataset
ncol(dat)
## [1] 88
nrow(dat)
## [1] 12366
Now we have a dataset to work with on our analysis that includes all of our variables of interest. This dataset includes 85 variables on 12438 districts.
For the exploratory data analysis, we start by summarizing our outcome of interest: district academic achievement.
To begin our exploratory data analysis on our outcome, it would be helpful to first take a look at how much variation we have in our data, both within states and across districts. We can begin to visualize this variation using box plots, grouped by states.
# creating a box-plot of district academic achievement by state
# we use mutate(reorder()) function to order states by median district academic achievement scores
dat %>% mutate(stateabb =
reorder(stateabb, cs_mn_avg_ol, FUN = median)) %>%
ggplot(aes(stateabb, cs_mn_avg_ol, group = stateabb)) +
geom_boxplot(fill = "lightsalmon1", outlier.color = "black", outlier.fill = "lightsalmon1", outlier.shape = 21) + labs(x = "State Abbreviation", y = "District Academic Achievmenet (Test Scores)") + theme_light() + theme(axis.text.x = element_text(angle = 90, hjust = 1)) + ggtitle("District Academic Achievement Across States")
We began with initial exploratory data analysis to visualize the variation of academic achievement between districts in the United States. From the box plot below, we notice there is substantial variation in test scores both between and within states. Districts in Massachusetts have the highest median test score, performing at about 0.5 standard deviations above the national average. On the other hand, the lowest median academic achievement scores (excluding DC) can be found in New Mexico, where the median score is about 0.4 standard deviations below national average. However, it appears most of the variation in academic performance appears to be coming from variation within states. For example, in many states, such as in California, you can find districts with both extremely high and extremely low test scores.
In our analysis, we hope to understand the factors that contribute to such variation in academic achievement between districts across the United States. The covariate dataset includes many potential predictors of district test scores. However, for the purpose of our project, we want to focus our analysis on a few key covariates in order to produce more focused and meaningful conclusions.
Before we begin, we need to specify our list of district-level predictors to choose from. Our current data set currently has approximately 60 covariates. However, many of these covariates are only applicable to some sub-populations. For example, for average SES, the data set includes a district-level measure and a measure broken down by racial identity. For this project, we are only interested in the district-level measures.
Our final list of covariates that we will be analyzing is presented below:
urban: district is in a city/urban locale (binary; 0 = no, 1 = yes)suburb: district is in a suburban locale (binary; 0 = no, 1 = yes)town: district is in a town locale (binary; 0 = no, 1 = yes)rural: district is in a rural locale (binary; 0 = no, 1 = yes)perind: % Native American students in the districtperasn: % Asian students in the districtperhsp: % Hispanic students in the districtperblk: % Black students in the districtperwht: % White Students in the districtperfl: % students receiving free lunchperrl: % students receiving reduced lunchperfrl: % students receiving free or reduced lunchperecd: % economically disadvantaged students in the districtperell: % English Language Learnersperspeced: % students who receive Special Educationtotenrl: # students in the gradesesavgall: SES composite scorelninc50all: log of median incomebaplusall: % of adults in district with at least a bachelors degree (percentage)povertyall: poverty rate (%)unempall: unemployment rate (%)snapall: SNAP recipient rate (%)single_momall: % of household with children with single momppexp_inst: total per pupil expenditures on instruction (dollars)pprev_tot: revenue per pupil (dollars)#subsetting our data set to only include the variable so interest
# variable of interest include our outcome measures: cs_mn_avg_ol and
# district-level predictors as listed above
dat.ML <- subset(dat, select = c(urban, suburb, town, rural, perind, perasn, perhsp, perblk, perwht, perfl, perrl, perfrl, perecd, perell, perspeced, totenrl, sesavgall, lninc50avgall, baplusavgall, povertyavgall, unempavgall, snapavgall, single_momavgall, ppexp_tot, ppexp_inst, pprev_tot, cs_mn_avg_ol))
In our dataset, many of our covariates have missing values. While in many cases, missing values can be ignored, these missing values presented as an issue when running the Random Forest model below. In the manual on Random Forests by Brieman (2002), which is cited in the ?randomForest R Documentation, the author notes that the current verion of randomForest does not handle missing values. Online forums suggested multiple methods for approaching this issue from simply dropping the observations with missing values to imputing plausible values. One of the most common approaches is using missForest to impute the data. As stated in the [R Documentation] (https://www.rdocumentation.org/packages/missForest/versions/1.4/topics/missForest), missForest can be used to impute continuous and/or categorical data including complex interactions and nonlinear.
We decided using missForest to impute the missing data in the predictors was the best approach moving forward, but found that running the missForest function on our data took upwards of 1 hour. We anticipated this may be a challenge for other to knit our Rmarkdown file when this computation takes such a long time. Therefore, for the readers’ conveience, we will be re-uploading the dataset with this imputation already complete. Our process of running missForest then exporting the data frame can still be found below:
#This coding chunk will not run to save computation time (1hr +) of running missForest when knitting the RMarkdown
#missForest only runs on matrices, so we use as.matrix() on our current data frame
data.ML.imputed <- missForest(as.matrix(dat.ML))
#converting output back to data frame to be used in future analyses
data.ML.imputed.df <- as.data.frame(data.ML.imputed$ximp)
#exporting the data frame with imputed values to be used in the future without having to run missForest each time
write.csv(data.ML.imputed.df, file = "Part1_MachineLearning_Complete.csv", row.names = FALSE)
#uploading the data frame with the missing data in predictors imputed
ML.dat <- read.csv("Part1_MachineLearning_Complete.csv")
Now that all data has been imputed we now have the data that we can use forward with the ML.dat dataset to assess the relative importance of the predictors using a Random Forest model.
The ultimate purpose behind using a Random Forest Machine Learning Model is to determine the relative importance of the various covariates in our dataset for the prediction of district academic achievement, as defined above. While using a Random Forest model can show the relative importance of the predictors using incredibly robust methods, it can be helpful to first visualize some of these relationships before diving into this analysis.
We can use plots to visualize the relationship with District Academic Achievement (cs_mn_avg_ol) and our or continuous predictors (perind, perasn, perhsp, perblk, perwht, perfl, perrl, perfrl, perecd, perell, perspeced, totenrl, sesavgall, lninc50avgall,baplusavgall, povertyavgall, unempavgall, snapavgall, single_momavgall, ppexp_tot, ppexp_inst).
#To arrange the plots of the 21 continuous predictors, I will first save each of these ggplots:
#Saving plot of "District Academic Achievement (Test Scores)" vs "% Native American Students"
p1 <- ggplot(ML.dat, aes(x = perind, y = cs_mn_avg_ol)) +
geom_point(size=0.05, shape = ".", alpha = 0.5, color = "blueviolet") +
xlab("% Native American Students") + ylab("District Academic Achievement (Test Scores)") + theme_light() + geom_smooth(se=F, lwd=0.5, lty=2, method="lm", formula = y~poly(x,1), color = "black")
#Saving plot of "District Academic Achievement (Test Scores)" vs "% Asian Students"
p2 <- ggplot(ML.dat, aes(x = perasn, y = cs_mn_avg_ol)) +
geom_point(size=0.05, shape = ".", alpha = 0.5, color = "blueviolet") +
xlab("% Asian Students") + ylab("District Academic Achievement (Test Scores)") + theme_light() + geom_smooth(se=F, lwd=0.5, lty=2, method="lm", formula = y~poly(x,1), color = "black")
#Saving plot of "District Academic Achievement (Test Scores)" vs "% Hispanic Students"
p3 <- ggplot(ML.dat, aes(x = perhsp, y = cs_mn_avg_ol)) +
geom_point(size=0.05, shape = ".", alpha = 0.5, color = "blueviolet") +
xlab("% Hispanic Students") + ylab("District Academic Achievement (Test Scores)") + theme_light() + geom_smooth(se=F, lwd=0.5, lty=2, method="lm", formula = y~poly(x,1), color = "black")
#Saving plot of "District Academic Achievement (Test Scores)" vs "% Black Students"
p4 <- ggplot(ML.dat, aes(x = perblk, y = cs_mn_avg_ol)) +
geom_point(size=0.05, shape = ".", alpha = 0.5, color = "blueviolet") +
xlab("% Black Students") + ylab("District Academic Achievement (Test Scores)") + theme_light() + geom_smooth(se=F, lwd=0.5, lty=2, method="lm", formula = y~poly(x,1), color = "black")
#Saving plot of "District Academic Achievement (Test Scores)" vs ""% White Students"
p5 <- ggplot(ML.dat, aes(x = perwht, y = cs_mn_avg_ol)) +
geom_point(size=0.05, shape = ".", alpha = 0.5, color = "blueviolet") +
xlab("% White Students") + ylab("District Academic Achievement (Test Scores)") + theme_light() + geom_smooth(se=F, lwd=0.5, lty=2, method="lm", formula = y~poly(x,1), color = "black")
#Saving plot of "District Academic Achievement (Test Scores)" vs "% w/ Free Lunch"
p6 <- ggplot(ML.dat, aes(x = perfl, y = cs_mn_avg_ol)) +
geom_point(size=0.05, shape = ".", alpha = 0.5, color = "blueviolet") +
xlab("% w/ Free Lunch") + ylab("District Academic Achievement (Test Scores)") + theme_light() + geom_smooth(se=F, lwd=0.5, lty=2, method="lm", formula = y~poly(x,1), color = "black")
#Saving plot of "District Academic Achievement (Test Scores)" vs ""% w/ Reduced Lunch"
p7 <- ggplot(ML.dat, aes(x = perrl, y = cs_mn_avg_ol)) +
geom_point(size=0.05, shape = ".", alpha = 0.5, color = "blueviolet") +
xlab("% w/ Reduced Lunch") + ylab("District Academic Achievement (Test Scores)") + theme_light() + geom_smooth(se=F, lwd=0.5, lty=2, method="lm", formula = y~poly(x,1), color = "black")
#Saving plot of "District Academic Achievement (Test Scores)" vs "% w/ Free/Reduced Lunch"
p8 <- ggplot(ML.dat, aes(x = perfrl, y = cs_mn_avg_ol)) +
geom_point(size=0.05, shape = ".", alpha = 0.5, color = "blueviolet") +
xlab("% w/ Free/Reduced Lunch") + ylab("District Academic Achievement (Test Scores)") + theme_light() + geom_smooth(se=F, lwd=0.5, lty=2, method="lm", formula = y~poly(x,1), color = "black")
#Saving plot of "District Academic Achievement (Test Scores)" vs "% Economically Disadvantaged"
p9 <- ggplot(ML.dat, aes(x = perecd, y = cs_mn_avg_ol)) +
geom_point(size=0.05, shape = ".", alpha = 0.5, color = "blueviolet") +
xlab("% Economically Disadvantaged") + ylab("District Academic Achievement (Test Scores)") + theme_light() + geom_smooth(se=F, lwd=0.5, lty=2, method="lm", formula = y~poly(x,1), color = "black")
#Saving plot of "District Academic Achievement (Test Scores)" vs "% English Language Learners"
p10 <- ggplot(ML.dat, aes(x = perell, y = cs_mn_avg_ol)) +
geom_point(size=0.05, shape = ".", alpha = 0.5, color = "blueviolet") +
xlab("% English Language Learners") + ylab("District Academic Achievement (Test Scores)") + theme_light() + geom_smooth(se=F, lwd=0.5, lty=2, method="lm", formula = y~poly(x,1), color = "black")
#Saving plot of "District Academic Achievement (Test Scores)" vs "% Students in Special Ed."
p11<- ggplot(ML.dat, aes(x = perspeced, y = cs_mn_avg_ol)) +
geom_point(size=0.05, shape = ".", alpha = 0.5, color = "blueviolet") +
xlab("% Students in Special Ed.") + ylab("District Academic Achievement (Test Scores)") + theme_light() + geom_smooth(se=F, lwd=0.5, lty=2, method="lm", formula = y~poly(x,1), color = "black")
#Saving plot of "District Academic Achievement (Test Scores)" vs "# Enrolled in District"
p12 <- ggplot(ML.dat, aes(x = totenrl, y = cs_mn_avg_ol)) +
geom_point(size=0.05, shape = ".", alpha = 0.5, color = "blueviolet") +
xlab("# Enrolled in District") + ylab("District Academic Achievement (Test Scores)") + theme_light() + geom_smooth(se=F, lwd=0.5, lty=2, method="lm", formula = y~poly(x,1), color = "black")
#Saving plot of "District Academic Achievement (Test Scores)" vs "SES Score"
p13 <- ggplot(ML.dat, aes(x = sesavgall, y = cs_mn_avg_ol)) +
geom_point(size=0.05, shape = ".", alpha = 0.5, color = "blueviolet") +
xlab("SES Score") + ylab("District Academic Achievement (Test Scores)") + theme_light() + geom_smooth(se=F, lwd=0.5, lty=2, method="lm", formula = y~poly(x,1), color = "black")
#Saving plot of "District Academic Achievement (Test Scores)" vs "Log Median Income"
p14 <- ggplot(ML.dat, aes(x = lninc50avgall, y = cs_mn_avg_ol)) +
geom_point(size=0.05, shape = ".", alpha = 0.5, color = "blueviolet") +
xlab("Log Median Income") + ylab("District Academic Achievement (Test Scores)") + theme_light() + geom_smooth(se=F, lwd=0.5, lty=2, method="lm", formula = y~poly(x,1), color = "black")
#Saving plot of "District Academic Achievement (Test Scores)" vs "% Adults in District w/ BA"
p15 <- ggplot(ML.dat, aes(x = baplusavgall, y = cs_mn_avg_ol)) +
geom_point(size=0.05, shape = ".", alpha = 0.5, color = "blueviolet") +
xlab("% Adults in District w/ BA") + ylab("District Academic Achievement (Test Scores)") + theme_light() + geom_smooth(se=F, lwd=0.5, lty=2, method="lm", formula = y~poly(x,1), color = "black")
#Saving plot of "District Academic Achievement (Test Scores)" vs "Poverty Rate (%)"
p16 <- ggplot(ML.dat, aes(x = povertyavgall, y = cs_mn_avg_ol)) +
geom_point(size=0.05, shape = ".", alpha = 0.5, color = "blueviolet") +
xlab("Poverty Rate (%)") + ylab("District Academic Achievement (Test Scores)") + theme_light() + geom_smooth(se=F, lwd=0.5, lty=2, method="lm", formula = y~poly(x,1), color = "black")
#Saving plot of "District Academic Achievement (Test Scores)" vs "Unemployment Rate (%)"
p17 <- ggplot(ML.dat, aes(x = unempavgall, y = cs_mn_avg_ol)) +
geom_point(size=0.05, shape = ".", alpha = 0.5, color = "blueviolet") +
xlab("Unemployment Rate (%)") + ylab("District Academic Achievement (Test Scores)") + theme_light() + geom_smooth(se=F, lwd=0.5, lty=2, method="lm", formula = y~poly(x,1), color = "black")
#Saving plot of "District Academic Achievement (Test Scores)" vs "SNAP Rate (%)"
p18 <- ggplot(ML.dat, aes(x = snapavgall, y = cs_mn_avg_ol)) +
geom_point(size=0.05, shape = ".", alpha = 0.5, color = "blueviolet") +
xlab("SNAP Rate (%)") + ylab("District Academic Achievement (Test Scores)") + theme_light() +
geom_smooth(se=F, lwd=0.5, lty=2, method="lm", formula = y~poly(x,1), color = "black")
#Saving plot of "District Academic Achievement (Test Scores)" vs "% Students w/ Single Mom"
p19 <- ggplot(ML.dat, aes(x = single_momavgall, y = cs_mn_avg_ol)) +
geom_point(size=0.05, shape = ".", alpha = 0.5, color = "blueviolet") +
xlab("% Students w/ Single Mom") + ylab("District Academic Achievement (Test Scores)") + theme_light() + geom_smooth(se=F, lwd=0.5, lty=2, method="lm", formula = y~poly(x,1), color = "black")
#Saving plot of "District Academic Achievement (Test Scores)" vs "Expenditure Per Pupil (USD)"
p20 <- ggplot(ML.dat, aes(x = ppexp_tot, y = cs_mn_avg_ol)) +
geom_point(size=0.05, shape = ".", alpha = 0.5, color = "blueviolet") +
xlab("Expenditure PP (USD)") + ylab("District Academic Achievement (Test Scores)") + theme_light() + geom_smooth(se=F, lwd=0.5, lty=2, method="lm", formula = y~poly(x,1), color = "black")
#Saving plot of "District Academic Achievement (Test Scores)" vs Instruction Expenditures Per Pupil (USD)"
p21 <- ggplot(ML.dat, aes(x = ppexp_inst, y = cs_mn_avg_ol)) +
geom_point(size=0.05, shape = ".", alpha = 0.5, color = "blueviolet") +
xlab("Instruction Expenditures PP (USD)") + ylab("District Academic Achievement (Test Scores)") + theme_light() + geom_smooth(se=F, lwd=0.5, lty=2, method="lm", formula = y~poly(x,1), color = "black")
#saving the all of the plots together with 3 plots per row
all.predictors.plot <- ggarrange(p1, p2, p3, p4, p5, p6, p7, p8, p9, p10, p11, p12, p13, p14, p15, p16, p17, p18, p19, p20, p21,
nrow = 1, ncol = 3)
p1
# the following code prints all our plots on 7 different "pages" in the output
# this allows for optimal visualization of the plots without them being too
# condensed which can make them hard to read
all.predictors.plot[[1]]
all.predictors.plot[[2]]
all.predictors.plot[[3]]
all.predictors.plot[[4]]
all.predictors.plot[[5]]
all.predictors.plot[[6]]
all.predictors.plot[[7]]
For the binary predictors describing the town locale (urban, suburb, town, rural), we can use a box plot to compare their distributions of District Average Test Scores:
# to help create a box plot, we first reshape our data to create one variable
# that indicates whether a district is characterized as urban, suburban, town,
# or rural. These characterizations are all mutually exclusive
boxplot.dta <- ML.dat %>% mutate(locale =
case_when(urban == 1 ~ "Urban",
suburb == 1 ~ "Suburban",
town == 1 ~ "Town",
rural == 1 ~ "Rural"))
#Plotting the new "locale" variable using a boxplot:
ggplot(boxplot.dta, aes(locale, cs_mn_avg_ol, group = locale)) +
geom_boxplot(fill = "cyan2", outlier.color = "black", outlier.fill = "cyan2", outlier.shape = 21) + labs(x = "District Locale", y = "District Academic Achievement (Test Scores)") + theme_light() + ggtitle("District Academic Achievement by Locale")
Our above scatter plots and box plot confirm what we initially suspected: many of these variables do in fact appear to predict academic achievement. For example, it appears baplus_all, the percent of adults in the district with a bachelor’s degree, is positively correlated with academic achievement. On the other hand, the percent of households in poverty or with a single mother appear to be negatively correlated. For other variables, such as per pupil expenditure (ppexpt_inst), the association is less clear. Overall, while these scatter plots give us a general sense of what variables may be important, it still is not clear which variables matter the most for academic achievement.
To answer this question, we will turn to the Random Forest Model.
To begin our machine learning exercises, we first must great our training and test sets. Here we are using training and test sets that both include 50% of the dataset.
set.seed(20)
#creating a data partition that splits the data frame in half
index_train<- createDataPartition(y = ML.dat$cs_mn_avg_ol, times =1, p=0.05, list = FALSE)
#labeling the first "slice" as the train set
train_set <- slice(ML.dat, index_train)
#labelign the second "slice" as the test set
test_set <- slice(ML.dat, -index_train)
Next, we will use a Random Forest Model to determine variables of importance, as measured using the Gini Coefficient.
set.seed(1)
#fitting a random forest with all 25 predictors
rf_fit <- randomForest(cs_mn_avg_ol ~ urban + suburb + town + rural + perind +
perasn + perhsp + perblk + perwht +
perfl + perrl + perfrl + perecd + perell +
perspeced + totenrl +
sesavgall + lninc50avgall +
baplusavgall + povertyavgall +
unempavgall + snapavgall +
single_momavgall + ppexp_tot +
ppexp_inst, data = train_set, mtry = 25, importance = TRUE)
The Random Forest model above can be used to produce the following ranking for the Gini Index Decrease. The higher the Gini Index decrease for a variable, the more important that variable is for predicting District Academic Achievement.
#extracting the variable improtance from the random forest model above
variable_importance <- importance(rf_fit)
#creating a tribble of variable importance arrange in descenting Gini index
variable_importance <- tibble(Predictor = rownames(variable_importance),
Gini = variable_importance[,1]) %>%
arrange(desc(Gini))
#outputing a table of Gini index by predictor
kable(variable_importance[1:25,])
| Predictor | Gini |
|---|---|
| perecd | 31.8322521 |
| baplusavgall | 28.0251515 |
| perrl | 22.9322785 |
| perfrl | 18.7984632 |
| lninc50avgall | 14.9363614 |
| perfl | 13.0046478 |
| perwht | 12.8787563 |
| sesavgall | 11.3589026 |
| snapavgall | 10.6806359 |
| unempavgall | 9.6144278 |
| perind | 9.3554585 |
| povertyavgall | 9.2768206 |
| perell | 8.5732452 |
| single_momavgall | 7.8481056 |
| perblk | 7.2023434 |
| totenrl | 5.2429317 |
| perasn | 5.1297962 |
| ppexp_tot | 4.0886798 |
| perhsp | 4.0187372 |
| ppexp_inst | 3.1537956 |
| perspeced | 2.8949649 |
| suburb | 0.7842414 |
| rural | 0.3602715 |
| town | -0.1114319 |
| urban | -0.8412145 |
We can create a visual representation of this ranking using ggplot as done below:
#Creating labels to use for each of the predictors
labels <- c(urban = "Urban District", suburb = "Suburban District", town = "Town District", rural = "Rural District", perind = "% Native American Students", perasn = "% Asian Students",
perhsp= "% Hispanic Students",
perblk= "% Black Students",
perwht= "% White Students",
perfl= "% Students w/ Free Lunch",
perrl= "% Students w/ Reduced Lunch",
perfrl= "% Students w/ Free or Reduced Lunch",
perecd= "% Students Economically Disadvantaged",
perell= "% English Lanugage Learners",
perspeced= "% Students with Special Ed.",
totenrl= "Number of Students in District",
sesavgall= "SES Composite Score",
lninc50avgall= "Log Median Income",
baplusavgall= "% Adults in District w/ Bachelors(+) degree",
povertyavgall= "Poverty Rate (%)",
unempavgall= "Unemployment Rate (%)",
snapavgall= "SNAP Rate (%)",
single_momavgall = "% Students w/ Single Mom",
ppexp_tot = "Expenditures Per Pupil (USD)",
ppexp_inst = "Instuction Expenditures Per Pupil (USD)",
pprev_tot= "Revenue Per Pupil (USD)"
)
#Adding the labels to the variable_importnace data frame
variable_importance$labels <- as.character(labels[variable_importance$Predictor])
#creating a chart of relative importtance of the predictors based on Gini index
chart.importance <- ggplot(variable_importance, aes(x = reorder(labels, Gini),
y = Gini)) +
geom_col(width = 0.5, fill = "darkolivegreen3") +
theme_light() + coord_flip() + xlab("Predictor") +
ylab ("Gini Score") + ggtitle("Importance in Predicting Academic Achievement")
chart.importance
From the above ranking of relative predictive importance of the 25 variables, we can draw some conclusions that can inform our next analyses. First of all, variables describing the percentage of students receiving free or reduced lunch have high levels of importance for the Random Forest Model predicting district academic achievement. Socio-economic variables, including the percentage of students who are categorized as “Economically Disadvantaged” and the percent of adults with at least a Bachelor’s degree, are some of the most important predictors of academic achievement. While not ranked in the top 5, the socio-economic status (SES) composite score, the poverty rate, and the unemployment rate all still important predictors of a academic achievement and are variables of particular interest to our team.
Based on these analyses, our team then decided to explore more in depth the role of some of the socio-economic variables and free/reduced lunch in district academic achievement.
To reiterate what we was mentioned above as we begin our next phase of analysis, we use a stargazer to observe the mean and standard deviations of each variable and then review the correlation between each of the unemployment, SES scores, poverty, and average academic achievement variables. We see that SES, poverty, and unemployment are all highly correlated with one another as expected. Furthermore, we observe that all three variables are correlated with average academic achievement with SES and average test scores having a correlation of 0.76, unemployment and average academic achievement have a negative correlation of -0.57 and poverty and average academic achievement also having a negative correlation of -0.67.
# using stargazer to observe the mean and standardeviations of SES, poverty,
# unemployment, and academic achievement
dat2 <- dat %>%
group_by(stateabb) %>%
summarise(ses = mean(sesavgall), poverty = mean(povertyavgall), unemployment = mean(unempavgall))
stargazer(data.frame(dplyr::select(dat, cs_mn_avg_ol, unempavgall, povertyavgall, sesavgall)),
type="text")
##
## ==================================================================
## Statistic N Mean St. Dev. Min Pctl(25) Pctl(75) Max
## ------------------------------------------------------------------
## cs_mn_avg_ol 12,366 0.021 0.334 -2.341 -0.185 0.222 1.248
## unempavgall 12,366 0.060 0.020 0.002 0.047 0.071 0.221
## povertyavgall 12,366 0.126 0.062 0.007 0.081 0.159 0.460
## sesavgall 12,366 0.333 0.846 -4.401 -0.157 0.872 2.910
## ------------------------------------------------------------------
#correlation between outcome and unemployment, poverty, and ses
round(cor(dplyr::select(dat, cs_mn_avg_ol, unempavgall, povertyavgall, sesavgall)), 2)
## cs_mn_avg_ol unempavgall povertyavgall sesavgall
## cs_mn_avg_ol 1.00 -0.57 -0.67 0.76
## unempavgall -0.57 1.00 0.65 -0.75
## povertyavgall -0.67 0.65 1.00 -0.92
## sesavgall 0.76 -0.75 -0.92 1.00
After seeing the descriptive statistics of each variable, we plot histograms of unemployment, SES scores, poverty, and average academic achievement to display the distribution of each variable. SES is measured as sesavgall which is the SES composite score for all families in the district between 2005-2009 and 2014, measured in standard deviations. For example, a district with a sesavgall score of zero is performing in line with the average district socioeconomic status at the national level. On the other hand, a district with a sesavgall score of -1 is performing 1 standard deviation below the national average while a district with a sesavgall score of + 1 is performing 1 standard deviation above.
povertyavgall represents the poverty rate of all families in the district between 2005-2009 and 2014 and unempavgall represents the unemployment rate of all families in the district between 2005-2009 and 2014. As described in in the “Data Source” section above, cs_mn_avg_ol measures the mean district performance on Math/Reading standardized tests and is reported as standard deviations away from the national average. We create histogram and annotate them with the means and standard deviations of each variable. We see that poverty, academic achievement, and SES all have wider spreads of variation and the distribution of unemployment seems to be more narrow.
# How much variation is there in average academic achievement?
ggplot(aes(x=cs_mn_avg_ol), data = dat) +
geom_histogram(color="black", fill="cyan2", binwidth = 0.05) +
annotate("text", x=-1.8, y=750, label = "Mean=0.021, SD=0.33",
size=5, hjust=0, vjust=0) +
xlab("District Academic Achievement (Test Scores)") +
ylab("Frequency") +
theme_classic()
# How much variation is there in unemployment?
ggplot(aes(x=unempavgall), data = dat) +
geom_histogram(color="black", fill="lightsalmon1", binwidth = 0.005) +
annotate("text", x=-0.15, y=1010, label = "Mean=0.06 , SD=0.02",
size=5, hjust=0, vjust=0) +
xlab("District Unemployment Rate") +
ylab("Frequency") +
theme_classic()
# How much variation is there in poverty?
ggplot(aes(x=povertyavgall), data = dat) +
geom_histogram(color="black", fill="darkolivegreen3", binwidth = 0.01) +
annotate("text", x=0.21, y=600, label = "Mean = 0.126, SD=0.062",
size=5, hjust=0, vjust=0) +
xlab("District Poerty Rate") +
ylab("Frequency") +
theme_classic()
# How much variation is there in SES?
ggplot(aes(x=sesavgall), data = dat) +
geom_histogram(color="black", fill="blueviolet", binwidth = 0.1) +
annotate("text", x=-3.5, y=500, label = "Mean = 0.33, SD=0.85",
size=5, hjust=0, vjust=0) +
xlab("Average SES") +
ylab("Frequency") +
theme_classic()
Now we want to utilize ggplot to plot the association between the three socioeconomic factors including unemployment, SES scores, and poverty with average academic achievement. We also obtain a pearson correlation coefficient (r) for each graph and annotate it on the plot. Each point represent a district and the size of each point is the total enrollment between grades 3-8 in that district. Lastly, we draw a non-linear line that follows the trajectory of our data points. In the first plot, we see that there is a negative association between unemployment and average district academic achievement. Our r = -0.57. Similarly in the next plot, we also observe that there is a negative association between poverty and average district academic achievement with an r = -0.67. In the last plot, we see that there is actually a positive association between SES and average district academic achievement with the highest correlation coefficient of r = 0.76.
# What is the association between unemployment and average academic achievement?
r1 <- round(cor(dat$cs_mn_avg_ol, dat$unempavgall),2)
ggplot(aes(x=unempavgall, y=cs_mn_avg_ol), data = dat) +
geom_point(alpha = 0.3, pch=21, color = "black", fill = "lightsalmon1", aes(size = totenrl)) +
geom_smooth(se=F, lwd=0.5, lty=2, method="lm", formula = y~poly(x,3)) +
annotate("text", x=0.03, y=-1.5, hjust=1, vjust=0,
label = paste0("r=",r1)) +
scale_size(range = c(1,30)) +
guides(size="none") +
ggtitle("Correlation Between Unemployment and Average Academic Achievement") + theme_light() +
xlab("Unemployment Rate (%)") +
ylab("District Academic Achievement (Test Scores)")
# What is the association between poverty and average academic achievement?
r2 <- round(cor(dat$cs_mn_avg_ol, dat$povertyavgall),2)
ggplot(aes(x=povertyavgall, y=cs_mn_avg_ol), data = dat) +
geom_point(alpha = 0.3, pch=21, color = "black", fill = "darkolivegreen3", aes(size = totenrl)) +
geom_smooth(se=F, lwd=0.5, lty=2, method="lm", formula = y~poly(x,3)) +
annotate("text", x=0.05, y=-1.5, hjust=1, vjust=0,
label = paste0("r=",r2)) +
scale_size(range = c(1,30)) +
guides(size="none") +
ggtitle("Correlation Between Poverty and Average Academic Achievement") + theme_light() +
xlab("Poverty Rate (%)") +
ylab("District Academic Achievement (Test Scores)")
# What is the association between SES and average academic achievement?
r3 <- round(cor(dat$cs_mn_avg_ol, dat$sesavgall),2)
ggplot(aes(x=sesavgall, y=cs_mn_avg_ol), data = dat) +
geom_point(alpha = 0.3, pch=21, color = "black", fill = "blueviolet", aes(size = totenrl)) +
geom_smooth(se=F, lwd=0.5, lty=2, method="lm", formula = y~poly(x,3)) +
annotate("text", x=-3, y=0.75, hjust=1, vjust=0,
label = paste0("r=",r3)) +
scale_size(range = c(1,30)) +
guides(size=FALSE) +
ggtitle("Correlation Between SES and Average Academic Achievement") + theme_light() +
xlab("Average SES") +
ylab("District Academic Achievement (Test Scores)")
## Warning: `guides(<scale> = FALSE)` is deprecated. Please use `guides(<scale> =
## "none")` instead.
First, we create boxplots to display the range of the median unemployment rates between all 50 states. Here, we observe that South Dakota has the lowest unemployment rate and Mississippi has one of the highest unemployment rates (excluding DC). Then, Montana seems to have similar unemployment rates compared to Massachusetts. We note this to plot four states that we can compare in terms of their unemployment-academic achievement gradient.
# The unemployment distribution between all 50 states ordered by the median
p_unempdist <- dat %>%
mutate(stateabb = reorder(stateabb, unempavgall, FUN = median)) %>%
ggplot(aes(stateabb, unempavgall)) +
geom_boxplot(color = "darkorchid4") +
#scale_y_discrete() +
ylim(0, 0.2) +
scale_x_discrete() +
theme(axis.text.x = element_text(angle = 90, hjust = 0.5, size = 7)) +
ggtitle("Distribution of Unemployment Rates for each state") +
xlab("States") +
ylab("Unemployment Rate (%)") + theme_light() +
theme(axis.text.x = element_text(angle = 90, hjust = 1))
p_unempdist
## Warning: Removed 1 rows containing non-finite values (stat_boxplot).
Next, we found that according to the Bureau of Labor Statistics, Nebraska has the lowest unemployment rate as of September 2021 at 2.0. Washington and Massachusetts have very similar unemployment rates of 4.9 and 5.2, respectively. Lastly, California has the highest unemployment rate of 7.5. Therefore, we also specifically look at the differences in the unemployment-achievement gradient between Nebraska, Washington, Massachusetts, and California. To start off, we plot the boxplots of unemployment rates for the four states and note the differences in their medians.
# The unemployment distribution between WA vs. MA vs. NE vs. CA
dat %>%
filter(stateabb == "WA" | stateabb == "MA" | stateabb == "NE" | stateabb == "CA") %>%
ggplot() +
geom_boxplot(aes(x= stateabb, y=unempavgall, fill=stateabb), color = "black", outlier.colour = "navy", outlier.size = 1, notch = FALSE) +
scale_y_continuous(trans = "sqrt") +
xlab("State") +
ylab("Unemployment Rate (%)") +
theme(legend.position ="none") + theme_light()
Then, in our plot examining the different relationships between unemployment and academic achievement in WA, MA, NE, and CA, we notice that CA has a huge range of unemployment rates with the line stretching wide into higher unemployment scores. Meanwhile, the line for Nebraska has a range that stretches into the low unemployment rates. We see that all four lines have negative slopes with higher unemployment rates with lower scores of academic achievement and lower unemployment rates with higher scores of academic achievement. However, when comparing between states, the two lines for CA and NE seem to have very similar slopes that overlap with one another. Massachusetts and Washington have similar ranges of unemployment rates but the line is overall at a higher level. Districts in MA display higher average academic achievement scores in standard deviation units (relative to the national mean) overall compared to CA, NE, and WA.
# How does the unemployment gradient compare in WA vs. MA vs. NE vs. CA?
#Filtering the data set to only include WA, MA, NE, CA
dat_wa_ma_ne_ca <- filter(dat, stateabb %in% c("WA","MA", "NE", "CA"))
ggplot(aes(x=unempavgall, y=cs_mn_avg_ol), data = dat) +
geom_point(alpha = 0.2, pch=21, color = "black", fill = "grey", aes(size = totenrl)) +
geom_point(alpha=0.7, pch=21, color="black", aes(size=totenrl, fill=stateabb),
data = dat_wa_ma_ne_ca) +
geom_smooth(se=F, lwd=0.5, lty=2, method="lm", color="black", aes(fill=stateabb), formula = y~x,
data = dat_wa_ma_ne_ca) +
scale_size(range = c(1,30)) +
guides(size="none", color="none") +
labs(fill = "State") +
ggtitle("Academic Achievementy by Unemployment", subtitle = "Comparing gradients in WA, MA, NE, and CA") +
xlab("Unemployment Rate (%)") +
ylab("District Academic Achievement (Test Scores)") + theme_light()
#highlighting the differences in the four lines
ggplot(aes(x=unempavgall, y=cs_mn_avg_ol), data = dat) +
geom_point(alpha=0.04, pch=21, color="black", aes(size=totenrl, fill=stateabb),
data = dat_wa_ma_ne_ca, show.legend = FALSE) +
#geom_smooth(se=F, lwd=0.5, lty=2, method="lm", formula = y~poly(x,3)) +
scale_size(range = c(1,30)) +
geom_smooth(se=F, lwd=1, lty=1, method="lm", aes(color=stateabb), formula = y~x,
data = dat_wa_ma_ne_ca) +
ggtitle("Academic Achievementy by Unemployment", subtitle = "Comparing gradients in WA, MA, NE, and CA") +
xlab("Unemployment Rate (%)") +
ylab("District Academic Achievement (Test Scores)") +
labs(color = "State") + theme_light()
Finally, we also plot the gradients for the four states of Massachusetts, Montana, South Dakota, and Mississippi specified earlier. Here, we notice that South Dakota has a wide range of unemployment rates that stretch into lower numbers. MT and SD seem to have similar slopes while MS has a slope that is closer to zero. The unemployment-achievement gradient slope seems to be highest in magnitude (most negative) for MA but again, we notice that the average academic achievement scores are higher overall at all unemployment rates compared to all other states.
# How does the unemployment gradient compare in MS vs. MA vs. SD vs. MT?
#filtering to Mississippi, Massachusetts, South Dakota, and Montana
dat_ms_ma_sd_mt <- filter(dat, stateabb %in% c("MS","MA", "SD", "MT"))
#plotting the relationships by state
ggplot(aes(x=unempavgall, y=cs_mn_avg_ol), data = dat) +
geom_point(alpha=0.04, pch=21, color="black", aes(size=totenrl, fill=stateabb),
data = dat_ms_ma_sd_mt, show.legend = FALSE) +
scale_size(range = c(1,30)) +
geom_smooth(se=F, lwd=1, lty=1, method="lm", aes(color=stateabb), formula = y~x,
data = dat_ms_ma_sd_mt) +
ggtitle("Academic Achievementy by Unemployment", subtitle = "Comparing gradients in MS, MA, SD, and MT") +
xlab("Unemployment Rate (%)") +
ylab("District Academic Achievement (Test Scores)") +
labs(color = "State") + theme_light()
We note that there are several flaws in solely using unemployment as a covariate. We believe that there could be additional variation happening that is not just coming from unemployment. For further analyses, we can look at additional covariates of interest that may impact academic achievement such as income earned, % of students receiving free lunch, poverty levels, environmental factors, housing, and social policies. We also ask why Massachusetts has significantly higher scores of academic achievement at every level of unemployment compared to several other states. Perhaps this can be due to the educational system and policies in MA and/or the different programs available to school districts in Massachusetts. Other variables that may impact the academic performance of students in each state include family environments (such as domestic violence and abuse), racism, and other economic and social measures.
Furthermore, we chose to plot the academic achievement by SES composite scores for MA and WA to observe whether we will observe a difference in the relationship between SES and academic achievement and the relationship between unemployment and academic achievement. We see that there are some differences in the range and the gradient. By simply looking at the points, we notice that Massachusetts and Washington generally overlap in their relationships but MA does seem to have slightly higher levels of SES.
# How does the SES gradient compare in WA vs. MA?
#filtering data to Washington and Massachusetts
dat_wa_ma <- filter(dat, stateabb %in% c("WA","MA"))
#plotting the relationships by state
ggplot(aes(x=sesavgall, y=cs_mn_avg_ol), data = dat) +
geom_point(alpha = 0.2, pch=21, color = "black", fill = "grey", aes(size = totenrl)) +
geom_point(alpha=0.7, pch=21, color="black", aes(size=totenrl, fill=stateabb),
data = dat_wa_ma) +
geom_smooth(se=F, lwd=0.5, lty=2, method="lm", color="black", aes(fill=stateabb),
data = dat_wa_ma) +
scale_size(range = c(1,30)) +
guides(size=FALSE, color=FALSE) +
labs(fill = "State") +
ggtitle("How does the SES-achievement gradient differ in WA and MA?") +
xlab("Average SES") +
ylab("District Academic Achievement (Test Scores)") + theme_light()
## Warning: `guides(<scale> = FALSE)` is deprecated. Please use `guides(<scale> =
## "none")` instead.
## `geom_smooth()` using formula 'y ~ x'
Massachusetts and Washington may have similar unemployment rates but may look different on other scales such as housing or poverty or income levels. Therefore, we want to consider other covariates, such as % of free lunch, an indicator that our previous regression tree included as a predictor. We saw that the random forest model mentioned earlier concludes that percent of students receiving free lunch has high priority in influencing academic achievement.